COVID-19 has already had an immense effect globally, and million of people are contaminated in more than 218 nations. Here I tried to analysis the covid cases in Malaysia. This data set contains data of Covid Cases in different states. Where stated the death rate, growth of active cases in differnt state. This dataset is collected for the purpose of creating better visualizations for the COVID-19 cases in Malaysia according to the states.
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import os
import datetime
Data Directory Setup.
os.chdir('F:/SEM 06/BIG DATA/IND PROJECT/DATASET/')
Covid_MY = pd.read_csv('all_2020-03-27_2021-04-15.csv')
Covid_St = pd.read_csv('state_all.csv')
Covid_St_Cum = pd.read_csv('state_cumu.csv')
Check whether data successfully read or not!
Covid_MY.head(5)
| Date | Recovered | Cumulative Recovered | Imported Case | Local Case | Active Case | New Case | Cumulative Case | ICU | Ventilator | Death | Cumulative Death | URL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 27-03-20 | 44 | 259 | NaN | NaN | NaN | 130 | 2161 | 54 | 34 | 3 | 26 | https://kpkesihatan.com/2020/03/27/kenyataan-a... |
| 1 | 28-03-20 | 61 | 320 | NaN | NaN | NaN | 159 | 2320 | 73 | 54 | 1 | 27 | https://kpkesihatan.com/2020/03/28/kenyataan-a... |
| 2 | 29-03-20 | 68 | 388 | NaN | NaN | NaN | 150 | 2470 | 73 | 52 | 7 | 34 | https://kpkesihatan.com/2020/03/29/kenyataan-a... |
| 3 | 30-03-20 | 91 | 479 | NaN | NaN | NaN | 156 | 2626 | 94 | 62 | 3 | 37 | https://kpkesihatan.com/2020/03/30/kenyataan-a... |
| 4 | 31-03-20 | 58 | 537 | NaN | NaN | NaN | 140 | 2766 | 94 | 60 | 6 | 43 | https://kpkesihatan.com/2020/03/31/kenyataan-a... |
Covid_St.head(5)
| Date | JOHOR | KEDAH | KELANTAN | MELAKA | NEGERI SEMBILAN | PAHANG | PERAK | PERLIS | PULAU PINANG | SABAH | SARAWAK | SELANGOR | TERENGGANU | WP KUALA LUMPUR | WP LABUAN | WP PUTRAJAYA | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2020-03-27 | 20 | 1 | 4 | 0 | 6 | 4 | 8 | 0 | 6 | 10 | 15 | 36 | 2 | 16 | 0 | 2 |
| 1 | 2020-03-28 | 26 | 1 | 10 | 9 | 15 | 14 | 6 | 0 | 6 | 15 | 8 | 33 | 4 | 7 | 5 | 0 |
| 2 | 2020-03-29 | 24 | 1 | 15 | 1 | 9 | 12 | 11 | 0 | 1 | 0 | 11 | 33 | 2 | 28 | 0 | 2 |
| 3 | 2020-03-30 | 24 | 2 | 4 | 7 | 10 | 3 | 8 | 1 | 1 | 4 | 6 | 60 | 0 | 24 | 0 | 2 |
| 4 | 2020-03-31 | 16 | 0 | 4 | 2 | 9 | 3 | 5 | 1 | 6 | 5 | 21 | 32 | 0 | 34 | 0 | 2 |
Covid_St_Cum.head(5)
| Date | JOHOR | KEDAH | KELANTAN | MELAKA | NEGERI SEMBILAN | PAHANG | PERAK | PERLIS | PULAU PINANG | SABAH | SARAWAK | SELANGOR | TERENGGANU | WP KUALA LUMPUR | WP LABUAN | WP PUTRAJAYA | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2020-03-27 | 259 | 73 | 98 | 33 | 138 | 70 | 159 | 10 | 80 | 182 | 110 | 546 | 41 | 337 | 5 | 20 |
| 1 | 2020-03-28 | 285 | 74 | 108 | 42 | 153 | 84 | 165 | 10 | 86 | 197 | 118 | 579 | 45 | 344 | 10 | 20 |
| 2 | 2020-03-29 | 309 | 75 | 123 | 43 | 162 | 96 | 176 | 10 | 87 | 197 | 129 | 612 | 47 | 372 | 10 | 22 |
| 3 | 2020-03-30 | 333 | 77 | 127 | 50 | 172 | 99 | 184 | 11 | 88 | 201 | 135 | 672 | 47 | 396 | 10 | 24 |
| 4 | 2020-03-31 | 349 | 77 | 131 | 52 | 181 | 102 | 189 | 12 | 94 | 206 | 156 | 704 | 47 | 430 | 10 | 26 |
Covid_MY.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 385 entries, 0 to 384 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Date 385 non-null object 1 Recovered 385 non-null int64 2 Cumulative Recovered 385 non-null int64 3 Imported Case 86 non-null float64 4 Local Case 86 non-null float64 5 Active Case 86 non-null float64 6 New Case 385 non-null int64 7 Cumulative Case 385 non-null int64 8 ICU 385 non-null int64 9 Ventilator 385 non-null int64 10 Death 385 non-null int64 11 Cumulative Death 385 non-null int64 12 URL 385 non-null object dtypes: float64(3), int64(8), object(2) memory usage: 39.2+ KB
#check the Null Values
Covid_MY.isnull().sum()
Date 0 Recovered 0 Cumulative Recovered 0 Imported Case 299 Local Case 299 Active Case 299 New Case 0 Cumulative Case 0 ICU 0 Ventilator 0 Death 0 Cumulative Death 0 URL 0 dtype: int64
#Visualize the Null Values
sns.heatmap(Covid_MY.isnull(),yticklabels=False,cbar=False,cmap='YlGnBu')
<AxesSubplot:>
As we can see three column has nul value. So I've converted all null values for 2 columnn to zero. And drop the wanted column.
#Convert Null values to 0
Covid_MY['Imported Case'] = Covid_MY['Imported Case'].fillna(0)
Covid_MY['Local Case'] = Covid_MY['Local Case'].fillna(0)
#Delete unwanted column because we don't need "URL" for analysis purposes
Covid_MY = Covid_MY.drop("URL", axis=1)
Covid_MY.head(5)
| Date | Recovered | Cumulative Recovered | Imported Case | Local Case | Active Case | New Case | Cumulative Case | ICU | Ventilator | Death | Cumulative Death | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 27-03-20 | 44 | 259 | 0.0 | 0.0 | NaN | 130 | 2161 | 54 | 34 | 3 | 26 |
| 1 | 28-03-20 | 61 | 320 | 0.0 | 0.0 | NaN | 159 | 2320 | 73 | 54 | 1 | 27 |
| 2 | 29-03-20 | 68 | 388 | 0.0 | 0.0 | NaN | 150 | 2470 | 73 | 52 | 7 | 34 |
| 3 | 30-03-20 | 91 | 479 | 0.0 | 0.0 | NaN | 156 | 2626 | 94 | 62 | 3 | 37 |
| 4 | 31-03-20 | 58 | 537 | 0.0 | 0.0 | NaN | 140 | 2766 | 94 | 60 | 6 | 43 |
Covid_MY.describe().transpose()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Recovered | 385.0 | 906.036364 | 1225.341164 | 1.0 | 36.00 | 169.0 | 1277.0 | 5718.0 |
| Cumulative Recovered | 385.0 | 77008.602597 | 108007.407488 | 259.0 | 8375.00 | 10340.0 | 106832.0 | 349039.0 |
| Imported Case | 385.0 | 1.685714 | 4.173631 | 0.0 | 0.00 | 0.0 | 0.0 | 32.0 |
| Local Case | 385.0 | 524.563636 | 1122.742551 | 0.0 | 0.00 | 0.0 | 0.0 | 5723.0 |
| Active Case | 86.0 | 28599.081395 | 14310.913262 | 14097.0 | 14787.75 | 23593.0 | 42674.0 | 52186.0 |
| New Case | 385.0 | 950.509091 | 1178.036294 | 1.0 | 25.00 | 354.0 | 1472.0 | 5728.0 |
| Cumulative Case | 385.0 | 88323.384416 | 117503.173186 | 2161.0 | 8640.00 | 12813.0 | 133559.0 | 367977.0 |
| ICU | 385.0 | 88.332468 | 86.453285 | 0.0 | 8.00 | 83.0 | 147.0 | 327.0 |
| Ventilator | 385.0 | 38.441558 | 38.073563 | 0.0 | 4.00 | 30.0 | 64.0 | 145.0 |
| Death | 385.0 | 3.480519 | 4.518511 | 0.0 | 0.00 | 2.0 | 5.0 | 25.0 |
| Cumulative Death | 385.0 | 395.493506 | 399.421757 | 26.0 | 121.00 | 137.0 | 542.0 | 1363.0 |
Here is the short data summary. And I have used a formula to find the "Active Cases". So the formula is quite simple. Active Cases = Cumulative Case - (Cumulative Recovered + Cumulative Death)
# Calculate Active Case = Cumulative Case - (Cumulative Recovered + Cumulative Death)
Covid_MY['Active Case'] = (Covid_MY['Cumulative Case'] - (Covid_MY['Cumulative Recovered'] + Covid_MY['Cumulative Death']))
Covid_MY.head(5)
| Date | Recovered | Cumulative Recovered | Imported Case | Local Case | Active Case | New Case | Cumulative Case | ICU | Ventilator | Death | Cumulative Death | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 27-03-20 | 44 | 259 | 0.0 | 0.0 | 1876 | 130 | 2161 | 54 | 34 | 3 | 26 |
| 1 | 28-03-20 | 61 | 320 | 0.0 | 0.0 | 1973 | 159 | 2320 | 73 | 54 | 1 | 27 |
| 2 | 29-03-20 | 68 | 388 | 0.0 | 0.0 | 2048 | 150 | 2470 | 73 | 52 | 7 | 34 |
| 3 | 30-03-20 | 91 | 479 | 0.0 | 0.0 | 2110 | 156 | 2626 | 94 | 62 | 3 | 37 |
| 4 | 31-03-20 | 58 | 537 | 0.0 | 0.0 | 2186 | 140 | 2766 | 94 | 60 | 6 | 43 |
# Changing the format of date for anaysing better
Covid_MY['Date'] = pd.to_datetime(Covid_MY['Date'], format='%d-%m-%y')
Covid_MY.set_index('Date', inplace=True)
Covid_MY.head(5)
| Recovered | Cumulative Recovered | Imported Case | Local Case | Active Case | New Case | Cumulative Case | ICU | Ventilator | Death | Cumulative Death | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Date | |||||||||||
| 2020-03-27 | 44 | 259 | 0.0 | 0.0 | 1876 | 130 | 2161 | 54 | 34 | 3 | 26 |
| 2020-03-28 | 61 | 320 | 0.0 | 0.0 | 1973 | 159 | 2320 | 73 | 54 | 1 | 27 |
| 2020-03-29 | 68 | 388 | 0.0 | 0.0 | 2048 | 150 | 2470 | 73 | 52 | 7 | 34 |
| 2020-03-30 | 91 | 479 | 0.0 | 0.0 | 2110 | 156 | 2626 | 94 | 62 | 3 | 37 |
| 2020-03-31 | 58 | 537 | 0.0 | 0.0 | 2186 | 140 | 2766 | 94 | 60 | 6 | 43 |
corr = Covid_MY.corr()
plt.figure(figsize=(10,5))
sns.heatmap(corr, annot = True, cmap="YlGnBu")
<AxesSubplot:>
Necessary Library for visualization purposes
import plotly
import plotly.express as px
import plotly.graph_objs as go
from plotly.subplots import make_subplots
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
figure1 = px.scatter(Covid_MY,title = 'Daily New Cases in Malaysia by Covid 19',x=None,y= 'New Case',color = 'New Case' )
figure1
figure2 = px.scatter(Covid_MY,title = 'Daily recover Cases in Malaysia by Covid 19',x=None,y= 'Recovered',color = 'Recovered' )
figure2
figure3 = px.scatter(Covid_MY,title = 'Daily Death Cases in Malaysia by Covid 19',x=None,y= 'Death',color = 'Death' )
figure3
From the above graphs we can see the most new cases got in 30th january 2021, Which is exactly 5728. On the other hand, most recovered cases was in 17 February 2021, Which is around 57 hundred. In contrast, Most death case was in 18th February that was 25 cases.
#Here I have create a function to vasualize the Daily Cases, Recovered And Death.
figure4 = go.Figure()
#This function is for line graphs of the New Cases
figure4.add_trace(go.Scatter(x=Covid_MY.index, y=Covid_MY['Cumulative Case'],line=dict(color='blue'),name='Cumulative Cases'))
#This function is for line graphs of the Recovered Cases
figure4.add_trace(go.Scatter(x=Covid_MY.index, y=Covid_MY['Cumulative Recovered'],line=dict(color='green'),name='Cumulative Recovered'))
#This function is for line graphs of the Death Cases
figure4.add_trace(go.Scatter(x=Covid_MY.index, y=Covid_MY['Cumulative Death'],line=dict(color='red'),name='Cumulative Death'))
figure4.update_layout( title='Daily Total Cases of COVID-19 in Malaysia: ',
hovermode="x",
spikedistance=500, #This line is for showing the distance of spike
xaxis=dict(
linecolor="#FFFFFF",
showspikes=True, # This line is for showing spike line for X-axis
# Format spike
spikethickness=3,
spikedash="dot",
spikecolor="#000000",
spikemode="toaxis+across",
)
)
figure4.show()
This graphs represent the total growth rate covid 19 where it crearly stated the Cumulative Cases, Cumulative Recover, New death per day. As we can see the growth of total cases and recover cases have gradually increased in parallel way. On the contrary, the growth of total death cases approximately on the bottom line.
#Here I create a funtion to find the average "Newcases, Death and Recover per month"
covid_mnth = Covid_MY.resample("M").sum()
covid_mnth.reset_index(inplace=True)
covid_mnth.Date = covid_mnth.Date.apply(lambda dt: dt.replace(day=15)) # I set the value at day 15 which is middle of the month
covid_mnth = covid_mnth[['Date', 'Recovered', 'New Case', 'Death']] # Here I've select the specific column to analyse better.
covid_mnth.set_index('Date', inplace=True)
covid_mnth .head(5)
| Recovered | New Case | Death | |
|---|---|---|---|
| Date | |||
| 2020-03-15 | 322 | 735 | 20 |
| 2020-04-15 | 3634 | 3236 | 59 |
| 2020-05-15 | 2182 | 1817 | 13 |
| 2020-06-15 | 2001 | 820 | 6 |
| 2020-07-15 | 290 | 337 | 4 |
V_mnth = pd.melt(covid_mnth[['New Case', 'Recovered', 'Death']],
var_name='Case', value_name='Number',
ignore_index=False).reset_index()
V_mnth['log_number'] = np.log10(V_mnth['Number']).round(4)
V_mnth.head()
| Date | Case | Number | log_number | |
|---|---|---|---|---|
| 0 | 2020-03-15 | New Case | 735 | 2.8663 |
| 1 | 2020-04-15 | New Case | 3236 | 3.5100 |
| 2 | 2020-05-15 | New Case | 1817 | 3.2594 |
| 3 | 2020-06-15 | New Case | 820 | 2.9138 |
| 4 | 2020-07-15 | New Case | 337 | 2.5276 |
#The purpose of this function is for graphical representation for monthly covid cases in Malaysia.
figure5 = px.bar(V_mnth, x='Date', y='Number', color='Case', text='Number',color_discrete_sequence=['blue', 'green', 'red'])
#This the main layout function Where i grouped the bar among NewCases, Recovered, and Death
figure5.update_layout(title_text='Monthly Cases of COVID-19 in Malaysia:',
xaxis_title="Month : Mar,2020 - April,2021", yaxis_title='Count',uniformtext_minsize=5, barmode='group')
figure5.update_yaxes(type='log')
figure5.update_xaxes(dtick="M1", tickformat="%b\n%Y")
figure5.show(config={"displayModeBar": False, "showTips": False})
This figure 5 shows the monthly cumulative active cases, death cases and recovered case from march 2020 to April 2021. Where we can see the growth of transmissing weere continiously increasing after Septembar 2020. And pepole mostly affected by covid 19 on january and February. But it is noticeable that the recovery rate was higher than the death rate.
# Here i want to show the graphical represtation of transmission of Covid among the states.
#So I've choosen subplot grap for analysing better
fig, axes = plt.subplots(nrows=8, ncols=2)
Covid_St.plot(ax=axes[0,0], x='Date', y='JOHOR', figsize=(30, 36))
Covid_St.plot(ax=axes[0,1], x='Date', y='KEDAH', figsize=(30, 36))
Covid_St.plot(ax=axes[1,0], x='Date', y='KELANTAN', figsize=(30, 36))
Covid_St.plot(ax=axes[1,1], x='Date', y='MELAKA', figsize=(30, 36))
Covid_St.plot(ax=axes[2,0], x='Date', y='NEGERI SEMBILAN', figsize=(30, 36))
Covid_St.plot(ax=axes[2,1], x='Date', y='PAHANG', figsize=(30, 36))
Covid_St.plot(ax=axes[3,0], x='Date', y='PERAK', figsize=(30, 36))
Covid_St.plot(ax=axes[3,1], x='Date', y='PERLIS', figsize=(30, 36))
Covid_St.plot(ax=axes[4,0], x='Date', y='PULAU PINANG', figsize=(30, 36))
Covid_St.plot(ax=axes[4,1], x='Date', y='SABAH', figsize=(30, 36))
Covid_St.plot(ax=axes[5,0], x='Date', y='SARAWAK', figsize=(30, 36))
Covid_St.plot(ax=axes[5,1], x='Date', y='SELANGOR', figsize=(30, 36))
Covid_St.plot(ax=axes[6,0], x='Date', y='TERENGGANU', figsize=(30, 36))
Covid_St.plot(ax=axes[6,1], x='Date', y='WP KUALA LUMPUR', figsize=(30, 36))
Covid_St.plot(ax=axes[7,0], x='Date', y='WP LABUAN', figsize=(30, 36))
Covid_St.plot(ax=axes[7,1], x='Date', y='WP PUTRAJAYA', figsize=(30, 36))
<AxesSubplot:xlabel='Date'>
This figure 6 shows the growth rate of Covid 19 cases among the States. As we can see the trasmission of covid 19 gradually increased and after the first month of 2021 the transmission is decreased. But In state Sarawak the transmission was in steady mood.
#This necessary library need to import
from datetime import datetime, timedelta
#Creating a new dataframe for calculate the last date
Covid_St.Date = pd.to_datetime(Covid_St.Date)
Covid_St_Cum.Date = pd.to_datetime(Covid_St_Cum.Date)
Covid_St.set_index('Date', inplace=True)
Covid_St_Cum.set_index('Date', inplace=True)
#Here i have select the last date using iloc.
last_date = Covid_MY.iloc[-1].name
last_date = last_date.strftime('%b %d, %Y')
Confirmed_Total = Covid_St_Cum.iloc[[-1]].T.reset_index()
Confirmed_Total.columns = ['State', 'Confirmed']
Confirmed_Total['State_spaced'] = Confirmed_Total['State'] + ' '
figure7 = px.bar(Confirmed_Total.sort_values('Confirmed'), x='State_spaced', y='Confirmed', text='Confirmed', color='Confirmed',
color_continuous_scale='rainbow')
figure7.update_layout(uniformtext_minsize=10, uniformtext_mode='hide',
title=f'Total COVID-19 Cases by {last_date} in Malaysia ',
width=900, height=800,
xaxis_title='States', yaxis_title='Total Count',
showlegend=False, coloraxis_showscale=False)
figure7.update_traces(texttemplate='%{text:,}')
figure7.show(config={"displayModeBar": False, "showTips": False})
Figure 7 Shows the total Covid cases by April 15, 2021. From this graphs, we can clearly state that Selangor is the most affected by Covid 19. The cumulative cases for selangor is approximately 121817 Person. And lowest transmission was in state Perlis.
figure8 = px.scatter(Covid_St,title = 'Daily Cumulative Active Cases of Covid 19 in State Selangor',x=None,y= 'SELANGOR',color = 'SELANGOR' )
figure8
figure9 = go.Figure()
#This function is for line graphs of the New Cases
figure9.add_trace(go.Scatter(x=Covid_St.index, y=Covid_St['SELANGOR'],line=dict(color='firebrick'),name='SELANGOR'))
#This function is for line graphs of the Recovered Cases
figure9.add_trace(go.Scatter(x=Covid_St.index, y=Covid_St['SABAH'],line=dict(color='skyblue'),name='SABAH'))
#This function is for line graphs of the Death Cases
figure9.add_trace(go.Scatter(x=Covid_St.index, y=Covid_St['JOHOR'],line=dict(color='teal'),name='JOHOR'))
figure9.add_trace(go.Scatter(x=Covid_St.index, y=Covid_St['WP KUALA LUMPUR'],line=dict(color='rebeccapurple'),name='WP KUALA LUMPUR'))
figure9.update_layout( title='Among the 4 state where transmission of Covid was faster ',
hovermode="x",
spikedistance=500, #This line is for showing the distance of spike
xaxis=dict(
linecolor="#FFFFFF",
showspikes=True, # This line is for showing spike line for X-axis
# Format spike
spikethickness=3,
spikedash="dot",
spikecolor="#000000",
spikemode="toaxis+across",
)
)
figure9.show()
In figure 9 shows the transmission of covid 19 among the 4 state. However, from the figure 9 we can see that Kuala Lampur and the selangor has the faster transmission between the other two.
To sum up, The purpose of this analysis is to estimate the transmission rate of new Covid-19 cases in future on the upcoming days. If we see the statistics, we can see the increment of COVID-19 is significantly rising. By using the dataset of Malaysia covid-19 cases, selecting a some state for calculation and analisation puposes, I came up with some acceptable results which are interesting as well. Although this is not the 100% accurate analysis because there is a shortage of resource and data. However, for my analysis I just give a concept using some sort of prediction and calculation, which could help people in order to maintain a safe and healthy lifestyle in the future.